Red Wine Exploration by Darren

Univariate Plots Section

Let’s try to understing the data structure first.

## [1] 1599   13
##        X          fixed.acidity   volatile.acidity  citric.acid   
##  Min.   :   1.0   Min.   : 4.60   Min.   :0.1200   Min.   :0.000  
##  1st Qu.: 400.5   1st Qu.: 7.10   1st Qu.:0.3900   1st Qu.:0.090  
##  Median : 800.0   Median : 7.90   Median :0.5200   Median :0.260  
##  Mean   : 800.0   Mean   : 8.32   Mean   :0.5278   Mean   :0.271  
##  3rd Qu.:1199.5   3rd Qu.: 9.20   3rd Qu.:0.6400   3rd Qu.:0.420  
##  Max.   :1599.0   Max.   :15.90   Max.   :1.5800   Max.   :1.000  
##  residual.sugar     chlorides       free.sulfur.dioxide
##  Min.   : 0.900   Min.   :0.01200   Min.   : 1.00      
##  1st Qu.: 1.900   1st Qu.:0.07000   1st Qu.: 7.00      
##  Median : 2.200   Median :0.07900   Median :14.00      
##  Mean   : 2.539   Mean   :0.08747   Mean   :15.87      
##  3rd Qu.: 2.600   3rd Qu.:0.09000   3rd Qu.:21.00      
##  Max.   :15.500   Max.   :0.61100   Max.   :72.00      
##  total.sulfur.dioxide    density             pH          sulphates     
##  Min.   :  6.00       Min.   :0.9901   Min.   :2.740   Min.   :0.3300  
##  1st Qu.: 22.00       1st Qu.:0.9956   1st Qu.:3.210   1st Qu.:0.5500  
##  Median : 38.00       Median :0.9968   Median :3.310   Median :0.6200  
##  Mean   : 46.47       Mean   :0.9967   Mean   :3.311   Mean   :0.6581  
##  3rd Qu.: 62.00       3rd Qu.:0.9978   3rd Qu.:3.400   3rd Qu.:0.7300  
##  Max.   :289.00       Max.   :1.0037   Max.   :4.010   Max.   :2.0000  
##     alcohol         quality     
##  Min.   : 8.40   Min.   :3.000  
##  1st Qu.: 9.50   1st Qu.:5.000  
##  Median :10.20   Median :6.000  
##  Mean   :10.42   Mean   :5.636  
##  3rd Qu.:11.10   3rd Qu.:6.000  
##  Max.   :14.90   Max.   :8.000
## 'data.frame':    1599 obs. of  13 variables:
##  $ X                   : int  1 2 3 4 5 6 7 8 9 10 ...
##  $ fixed.acidity       : num  7.4 7.8 7.8 11.2 7.4 7.4 7.9 7.3 7.8 7.5 ...
##  $ volatile.acidity    : num  0.7 0.88 0.76 0.28 0.7 0.66 0.6 0.65 0.58 0.5 ...
##  $ citric.acid         : num  0 0 0.04 0.56 0 0 0.06 0 0.02 0.36 ...
##  $ residual.sugar      : num  1.9 2.6 2.3 1.9 1.9 1.8 1.6 1.2 2 6.1 ...
##  $ chlorides           : num  0.076 0.098 0.092 0.075 0.076 0.075 0.069 0.065 0.073 0.071 ...
##  $ free.sulfur.dioxide : num  11 25 15 17 11 13 15 15 9 17 ...
##  $ total.sulfur.dioxide: num  34 67 54 60 34 40 59 21 18 102 ...
##  $ density             : num  0.998 0.997 0.997 0.998 0.998 ...
##  $ pH                  : num  3.51 3.2 3.26 3.16 3.51 3.51 3.3 3.39 3.36 3.35 ...
##  $ sulphates           : num  0.56 0.68 0.65 0.58 0.56 0.56 0.46 0.47 0.57 0.8 ...
##  $ alcohol             : num  9.4 9.8 9.8 9.8 9.4 9.4 9.4 10 9.5 10.5 ...
##  $ quality             : int  5 5 5 6 5 5 5 7 7 5 ...
##   X fixed.acidity volatile.acidity citric.acid residual.sugar chlorides
## 1 1           7.4             0.70        0.00            1.9     0.076
## 2 2           7.8             0.88        0.00            2.6     0.098
## 3 3           7.8             0.76        0.04            2.3     0.092
## 4 4          11.2             0.28        0.56            1.9     0.075
## 5 5           7.4             0.70        0.00            1.9     0.076
## 6 6           7.4             0.66        0.00            1.8     0.075
##   free.sulfur.dioxide total.sulfur.dioxide density   pH sulphates alcohol
## 1                  11                   34  0.9978 3.51      0.56     9.4
## 2                  25                   67  0.9968 3.20      0.68     9.8
## 3                  15                   54  0.9970 3.26      0.65     9.8
## 4                  17                   60  0.9980 3.16      0.58     9.8
## 5                  11                   34  0.9978 3.51      0.56     9.4
## 6                  13                   40  0.9978 3.51      0.56     9.4
##   quality
## 1       5
## 2       5
## 3       5
## 4       6
## 5       5
## 6       5

Before we do any analysis, we would like to take a look on the data to see if there are any data with a missing values or infinite values that we need to take care in analysis process.

##                    X        fixed.acidity     volatile.acidity 
##                FALSE                FALSE                FALSE 
##          citric.acid       residual.sugar            chlorides 
##                FALSE                FALSE                FALSE 
##  free.sulfur.dioxide total.sulfur.dioxide              density 
##                FALSE                FALSE                FALSE 
##                   pH            sulphates              alcohol 
##                FALSE                FALSE                FALSE 
##              quality 
##                FALSE
##                    X        fixed.acidity     volatile.acidity 
##                FALSE                FALSE                FALSE 
##          citric.acid       residual.sugar            chlorides 
##                FALSE                FALSE                FALSE 
##  free.sulfur.dioxide total.sulfur.dioxide              density 
##                FALSE                FALSE                FALSE 
##                   pH            sulphates              alcohol 
##                FALSE                FALSE                FALSE 
##              quality 
##                FALSE

Luckily, there are no records that we need to take special care of missing fields or inifinte values.

Our data consist of 1599 observations, with 13 variables. As we want to ultimately find out the relationship between quality and other variable, let’s take a look at the quality distribution first.

As total.sulfur.dioxide is sum of free and bound sulfur dioxide, I would like to add a field of bound forms of SO2 to see if there are any use in the upcoming analysis.

Also, we will plot a histogram for all the variables to see, trying to pinpoint some interesting fields that we should take a closer look.sum(is.na(z$Ozone))

from the histogram that we plotted, we can see that there are a number of fields either having rather evenly distributed values across the range or not much variation in values at all. To confirm the latter, we will calculate the standard deviation of each of the columns.

##                    X        fixed.acidity     volatile.acidity 
##         4.617359e+02         1.741096e+00         1.790597e-01 
##          citric.acid       residual.sugar            chlorides 
##         1.948011e-01         1.409928e+00         4.706530e-02 
##  free.sulfur.dioxide total.sulfur.dioxide              density 
##         1.046016e+01         3.289532e+01         1.887334e-03 
##                   pH            sulphates              alcohol 
##         1.543865e-01         1.695070e-01         1.065668e+00 
##              quality bound.sulfur.dioxide 
##         8.075694e-01         2.705628e+01

As we can see, density, chlorides are having a very small variation, so we will just ignore them in the upcoming analysis, suspecting that they are not contributing much to the quality. The actual correlation of this will leave to bivariate analysis.

We would also like to know more about each of the interested fields(those mentioned in the dataset description that would affect the taste, i.e. volatile acidity, citric acid, residual sugar, free sulfur dioxide), let’s try to plot the interested fields individually.

The volatile acidity graph looks distributed normally. Citric acid seems to have a pretty uniform distribution across different level. The residual suagr one and the free.sulfur.dioxide one looks skewed. Let’s try to plot them in log scale and see if there are anything interesting.

We can see that free.sulfur.dioxide one looks much more normal after transforming into log scale. I would say we should analyse free.sulfur.dioxide in log scale going forward.

For residual.sugar, actually they are having a similar distribution as the orginal scaled one. In fact, if we look at the original one closely, we can see that the graph is not that skewed. It looks skewed just because there are some outliers on the right side. I guess those are the specially fruit favoured wine.

With this findings, it would also make sense for us to plot histogram for total and bound sulfur distribution as well. Both are also very skewed. Let’s try to plot them in log scale.

Both look distributed normally now. So here, we will create a new field for log-scaled free.sulfur.dioxide, bound.sulfur.dioxide and total.sulfur.dioxide.

Univariate Analysis

What is the structure of your dataset?

There are 1,599 red wine records in the dataset with 11 features(fixed.acidity, volatile.acidity, citric.acid, residual.sugar, chlorides, free.sulfur.dioxide, total.sulfur.dioxide, density, pH, sulphates, alcohol) and 1 output variable (quality).

All of the features are numbers, while quality are integers.

Other observations: 1. quality ranges from 3-8, without extreme value from 0-2 or 9-10. 2. quality is following a normal distribution 3. 75% of observations are having a chlorides level < 0.09, and surprisingly there are samples with almost 7 times of this value

What is/are the main feature(s) of interest in your dataset?

Based on the description from the dataset and some research, I think volatile acidity, citric acid, residual sugar, free sulfur dioxide would be the main features to predict quality.

What other features in the dataset do you think will help support your
investigation into your feature(s) of interest?

There are some other features that I suspect that fixed acidity/pH are highly related or even representing very similar trends. As there are impacting the taste, I would suspect that there will contribute partly to the quality score.

Also sulphates and the three SO2 fields should be correlated. We will perform correlation analysis in the next part to confirm these relationships.

Did you create any new variables from existing variables in the dataset?

I created a new field for bound form of sulfur dioxide, which is (total.sulfur.dioxide - free.sulfur.dioxide). While we don’t know if bounded form of SO2 would be of interest or correlation with the quality, we could do some correlation analysis in the next section.

Of the features you investigated, were there any unusual distributions?
Did you perform any operations on the data to tidy, adjust, or change the form
of the data? If so, why did you do this?

Luckily, there are no missing values or infinite values in our dataset.

As we found that sulfur.dioxide fields seems skewed and we tried to plot them with log scale. With log-scaled, we found that they are now distributed more normally. So we also created log-scaled version of these fields.

However, I do found that density and chloride field is having very small variation, which I would exclude them from further analysis as I doubt if that small variation are contributing to the difference in taste.

Bivariate Plots Section

Firstly, we will calculate the correlation matrix before we go ahead on more detailed analysis.

Let’s look at the guess we made in part 1, fixed acidity/pH are highly related also for sulphates and the three SO2 fields. Turns out part of our guess is correct, fixed acidity and pH does have a very strong negative correlation. However there are no significant correlations with the sulplates fields with the other 3 sulfur related fields.

Another focus was other high correlations between features. We found that there are high positive correlations with fixed.acidity and density/citric acid. Also, citric.acid is also fairly negatively correlated with volatile.acid and pH. Citric.acid is also fairly positive related with fixed.acidity.

While sulplates are not particularly correlated with the three sulfur fields, the three sulfur fields(both log and non-log scaled version) are highly correlated themselves.

So to conclude, I would drop a number of fields from further analysis, i.e. I will take total.sulfur.dioxide.log to represent all three sulfur fields, and pH to represent all acid-related fields).

To be more solid about these integrated fields, we will plot graphs of sulfur fields and acidic fields to take a look at their trend.

These two graphs confirmed our view that total.sulfur.dioxide.log can conclude both free and bound sulfur dioxide features.

These three plots also confirmed our views that fixed.acidity and citric.acid can be concluded by pH feature. While volatile.acidity showed a positive correlation with pH, that might actually referring to the higher volatile acidity, the lower the fixed acidity. But no matter what, pH would be a nice indicator of the acidity of the wine. If we are looking more closely with the correlation between pH and other fields, we would not be surprised that it also have a minor correlation with the acidic sulphur dioxide fields.

We will conclude our first part of bivariate analysis for combining related features. Next, we will move on to the next part for finding features that are critical for predicting the quality of wine.

We will now look at the correlation between quality and other features.

We would immediately notice the strong negative relationship between quality and volatile.acidity. However, there are not strong correlation between pH and quality. Thus, the total acidity indication concept(pH) we came up in the last part is not really useful in predicting quality. In turn, we should put a lesser weight on other acidic features but focusing on volatile acidity.

Also, it’s easy to spot that alcohol level is having a strong positive relationship with quality.

Therefore, in the next part, volatile.acidity and alcohol will be the main features that we are looking to predict the wine quality.

According to the dataset description, there are 2 fields particularly important in terms of taste or favor, which are citric.acid and residual.sugar. One is providing a fresh taste and another is providing a sweet taste. Surprisingly, sweet taste (represented by residual.sugar) doesn’t really correlated to quality while freshness (represented by citric.acid) have a weak effect on quality.

We will now plot the scatter plot for our two main features of interests and one side feature for citric acid.

Bivariate Analysis

Talk about some of the relationships you observed in this part of the
investigation. How did the feature(s) of interest vary with other features in
the dataset?

There are two part of bivariate analysis. First part concerns the duplication or redundance fields. We concluded that pH field can act as an integrated indicator of multiple acidic fields including fixed.acidity, volatile.acidity and citric.acid. However, this pH indicator doesn’t really correlate with the quality variable so we will put it aside for further investigation.

We also found that the three sulphur fields can be concluded by the total.sulfur.dioxide.log field but this field also have a weak correlation with the quality variable.

Did you observe any interesting relationships between the other features
(not the main feature(s) of interest)?

While we have selected volatile acidity, citric acid, residual sugar, free sulfur dioxide as main features to investigate in univariable analysis, we will need to change our main features into alcohol and volatile acidity after our analysis of correlation between quality and other fields.

Only alcohol and volatile acidity is showing significat relationship with quality in correlation analysis.

Of couse we also found other interesting relationship like density is strongly correlated with fixed.acidity but negatively correlated with the alcohol level. After some research, this is due to the fact that alcohol is having a lower density than water, while acid is having a higher density than water.

What was the strongest relationship you found?

The strongest relationship I found is between pH and fixed.acidity. This actually makes a lot of sense as other acidic features are only in a small amount which hardly impact the overall pH value. This might also explained why pH doesn’t really predict overall quality of wine as pH is dominated by fixed.acidity which most experts didn’t considered or feel like a critical factor contributing to the quality. In fact, volatile.acidity is the main factor affecting the quality but it is under-repesented in the pH value.

Multivariate Plots Section

In this section, we will focus on looking at how to predict quality of the wine. As we identified the two main fatures alread, we will first plot the graph for the 2 main features and quality score.

From the plot above, it is easily seen that all the good quality wine are distributed in the lower right part of the graph, indicating that the higher the alcohol level and the lower the volatile.acidity, the better the quality of the wine.

We will also try to build a mathematical model of the data making use of the two main features. and do some predictions on the existing dataset based on the new model. (Note: we are not separating the data into training set and test set as we are not going to do machine learning here. We just want to understand more about the dataset and exploration)

## 
## Call:
## lm(formula = quality ~ alcohol + volatile.acidity, data = redWineData)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -2.59342 -0.40416 -0.07426  0.46539  2.25809 
## 
## Coefficients:
##                  Estimate Std. Error t value Pr(>|t|)    
## (Intercept)       3.09547    0.18450   16.78   <2e-16 ***
## alcohol           0.31381    0.01601   19.60   <2e-16 ***
## volatile.acidity -1.38364    0.09527  -14.52   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.6678 on 1596 degrees of freedom
## Multiple R-squared:  0.317,  Adjusted R-squared:  0.3161 
## F-statistic: 370.4 on 2 and 1596 DF,  p-value: < 2.2e-16
##  predicted.absError 
##  Min.   :0.0003412  
##  1st Qu.:0.2321735  
##  Median :0.4236706  
##  Mean   :0.5266607  
##  3rd Qu.:0.7295434  
##  Max.   :2.5934155

With this simple model, we achieved a mean absolute error of 0.52, which is pretty accurate without any advanced machine learning techniques. We might also wanted to plot the absolute errors and see if this lower level of mean absolute error happened just becuase of biased data.

From this plot, we can see that most of the predictions are having a error or less than 1. In fact, according to the calculations that we did in the last part, more than 75% of predictions are having an error less than 0.73. But it is also worth mentioned that according to the r^2 value, only 31.7% of the variance of quality is explanined by this model.

In the bivariaate analysis, we also mentioned than citric.acid could be another minor factor to contribute to the model. Let’s put that into considerations.

## 
## Call:
## lm(formula = quality ~ alcohol + volatile.acidity + citric.acid, 
##     data = redWineData)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -2.59992 -0.40354 -0.07282  0.47165  2.23655 
## 
## Coefficients:
##                  Estimate Std. Error t value Pr(>|t|)    
## (Intercept)       3.05533    0.19433  15.722   <2e-16 ***
## alcohol           0.31384    0.01601  19.602   <2e-16 ***
## volatile.acidity -1.34286    0.11362 -11.818   <2e-16 ***
## citric.acid       0.06779    0.10291   0.659     0.51    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.6679 on 1595 degrees of freedom
## Multiple R-squared:  0.3172, Adjusted R-squared:  0.3159 
## F-statistic:   247 on 3 and 1595 DF,  p-value: < 2.2e-16
##  predicted.model2.absError
##  Min.   :0.0003412        
##  1st Qu.:0.2321735        
##  Median :0.4236706        
##  Mean   :0.5266607        
##  3rd Qu.:0.7295434        
##  Max.   :2.5934155

No matter from the stats or the absolute error plot, there are no significant improvement after we add citric acid as a factor. So we will stop here and settle with the 2 factor models for now. (Note: Either model is not a good model for predicting quality of red wine as they only explained less than 40% of the variance. To have a we could create a better model with machine techniques in the future, which is out of the scope of this project)

Before we conclude the multivariate analsis section, we will do a bit more analysis on the features to see if there are anything interesting. A while ago we were interested in using pH to represent all the acidic features, let’s see if that’s accurate.

These two plots confirm that the pH value are dominated by the fixed.acidity and not really impacted by other acidity values.

Another thing that worth looking into is the sulphates and free.sulfur.dioxide.log, which both are helping to preserve the wine by prevents microbial growth and the ocidation. Is quality impacted by these two factors?

Interestingly, seems there are no relationship between these wine preservative and the quality. Maybe the the winemakers would tune the level of perservative to a suitable amount which higher level or perservative doesn’t meant the wine is perserved in a better way.

So what about the level of volatile acidity and the preservatives?

While there are no obviously trend of between perservatives and volatile acidity, but there seems to be a really minor ground of high volatile acidity and low value in both sulphates and free.sulfur.dioxide.log level. Let’s try to create a perservative index by normalizing both perservatives fields and adding them up.

And here come the relationships of perservatives and volatile acidity: perservatives does help in perserving the wine by preventing the excess volatile acidity. However, one it surpassed a critical level(our perservative index of around 1), adding more perservative doesn’t help in improving the quality or taste of the wine. So are sulpates and free.sulfur.dioxide.log complement of each other?

From this plot, we don’t see any complement relationship between sulphates and free.sulfur.dioxide.log. In fact, they don’t seem to have any relationships at all.

While a lot of people prefer sweeter taste, let’s also look at the relationship of citric.acid and residual.sugar

From this plot, it’s clear that there are no particular preferences of sweetness for the experts evaluating these wine.

Multivariate Analysis

Talk about some of the relationships you observed in this part of the
investigation. Were there features that strengthened each other in terms of
looking at your feature(s) of interest?

There are two chemicals, free sulfur dioxide and sulphates contributing to the preservation of the wine but their amount doesn’t really impact the quality. However, if we are keeping the free sulfur dioxide level as constant, the higher the sulphates, the better the quality of the wine. But in fact, both factor doesn’t really contribute to the overall quality of the wine.

I also looked into experts preferences of sweetness of the wine but no particular trends were found. Apparently the experts doesn’t really have preferences of sweetness or taste of freshness of the wine.

Were there any interesting or surprising interactions between features?

We also looked at relationships of perservatives and volatile.acidity(a critical determinant of quality). perservatives does help in perserving the wine by preventing the excess volatile acidity. However, once it surpassed a critical level(our perservative index of around 1), adding more perservative doesn’t help in improving the quality or taste of the wine.

I also originally thought that the two perservatives features must have some kind of relationships, either positive relationships or complement but it seems that even we standardized the two perservative feature, they don’t seems to have any relationships.

OPTIONAL: Did you create any models with your dataset? Discuss the
strengths and limitations of your model.

I created a linear model based on the two main features that we identified, alcohol level and volatile acidity. While the good side of the model is that we are able to predict most (>75%) wines’ quality score with less than 0.5 difference comparing to the actual score (given that we are having a integer scale). However, the R^2 statistics is showing that only 30% of the quality variance is explanined by this model. To have a we could create a better model with machine techniques in the future, which is out of the scope of this project.


Final Plots and Summary

Plot One

Description One

The log-scaled sulfur dioxide plots represent the values that we used to analyse sulfur dioxide level. In univariate analysis, we found that all the three sulfur dioxide fields are right skewed and we tried to plot a log scaled version of all these three fields and able to convert them into a more normal distribution. We then created three new fields for all three log values for further analysis.

Plot Two

Description Two

The combination of the three acidic features against the fixed.acidity, coloured by the pH value indicated that an interesting fact that the pH value of the wine is dominated by fixed.acidity. All other acidic chemicals seems to have no significant impact on pH value. That make a lot of sense in terms of chemistry. total.sulfur.dioxide are in a degree of mg/dm^3 while all other acidic fields looking at citric acid, volatile acidity and fixed acidity, citric acid and volatile acid are in the degree of 10^-1 g/dm^3 while fixed.acidity is in the degree of 10^1 g/dm3. Assuming that they have less than 100 times of molarity differences, fixed acidity will have a absolute impact on acidity. But more interestingly is that, volatile acidity is contributing so much on the quality score. After some research on wikipedia, the reason is that acetic acid has a distinctive pungent smell, and in fact a main component of vineger.

Plot Three

Description Three

This plot is showing the relationship between the two main features and the quality. This plot proved our thought that alcohol and volatile.acidity can predict the wine quality. We ended up with a mathematical model of quality = 0.31381 * alcohol - 1.38364 * volatile.acidity + 3.09547 which are able to predict more than 75% of our data with less than 0.5 absolute error in quality.


Reflection

This red wine dataset is a pretty small dataset with only 1599 observations. However, there were a lot that could be investigated with these 12 variables. I first tried to plot a histogram of all variable to understand the overview of the data first and that went quite well. However this doesn’t helped me in identifying the features that we related to quality. This need to be done in bivariate analysis. All the important features that I identified in univariate analysis section turns out to be not really meaningful, while a variable I ignored (alcohol) turns out to be very important.

I was also surprised that a lot of factors that might seems important to a lot of people like sweetness, are not that important in experts’ eye. At the same time, alcohol level seemed to be not that important in terms of quality was extremely imprtant predictor.

The model and investigations that I did in this project is based on considering the vairables as individual features, i.e. we didn’t really develop new features by combining multiple features. Thus, the model is too simple to give an accurate prediction. According to the R^2 score, only less than 40% of the variance of the quality is explained by this simple model.

To further work on this dataset, we should employ more advanced machine learning techniques. Algorithms like neural network are better at combining multiple features and come up with comprehensive mathematical model. These models can also take more factors into considerations.